Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [http://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Online News Popularity dataset is a binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the article’s popularity level in social networks. The dataset does not contain the original content but some statistics associated with it. The original content can be publicly accessed and retrieved using the provided URLs.

Many thanks to K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal, for making the dataset and benchmarking information available.

In iteration Take1, the script focused on evaluating various machine learning algorithms and identifying the algorithm that produces the best accuracy result. Iteration Take1 established a baseline performance regarding accuracy and processing time.

For this iteration, we will examine the feasibility of using a dimensionality reduction technique of ranking the attribute importance with a gradient boosting tree method. Afterward, we will eliminate the features that do not contribute to the cumulative importance of 0.99 (or 99%).

ANALYSIS: From the previous iteration Take1, the baseline performance of the machine learning algorithms achieved an average accuracy of 64.53%. Three algorithms (Random Forest, AdaBoost, and Stochastic Gradient Boosting) achieved the top three accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.48%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.71%, which was just slightly below the training data.

In the current iteration, the baseline performance of the machine learning algorithms achieved an average accuracy of 64.29%. Two ensemble algorithms (Random Forest and Stochastic Gradient Boosting) achieved the top accuracy scores after the first round of modeling. After a series of tuning trials, Stochastic Gradient Boosting turned in the top result using the training data. It achieved an average accuracy of 67.51%. Using the optimized tuning parameter available, the Stochastic Gradient Boosting algorithm processed the validation dataset with an accuracy of 66.53%, which was just slightly below the accuracy of the training data.

From the model-building activities, the number of attributes went from 58 down to 42 after eliminating 16 attributes. The processing time went from 6 hours 31 minutes in iteration Take1 down to 3 hours 18 minutes in iteration Take2, which was a reduction of 49% from Take1.

CONCLUSION: The feature selection techniques helped by cutting down the attributes and reduced the training time. Furthermore, the modeling took a much shorter time to process yet still retained a comparable level of accuracy. For this dataset, the Stochastic Gradient Boosting algorithm and the attribute importance ranking technique should be considered for further modeling or production use.

Dataset Used: Online News Popularity Dataset

Dataset ML Model: Binary classification with numerical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

One potential source of performance benchmarks: [Benchmark URL - https://www.kaggle.com/uciml/pima-indians-diabetes-database]

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(parallel)
library(mailR)

# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Load dataset

originalDataset <- read.csv("OnlineNewsPopularity.csv", header= TRUE)

# Using the "shares" column to set up the target variable column
# targetVar <- 0 when shares < 1400, targetVar <- 1 when shares >= 1400
originalDataset$targetVar <- 0
originalDataset$targetVar[originalDataset$shares>=1400] <- 1
originalDataset$targetVar <- as.factor(originalDataset$targetVar)
originalDataset$shares <- NULL

# Dropping the two non-predictive attributes: url and timedelta
originalDataset$url <- NULL
originalDataset$timedelta <- NULL

# Different ways of reading and processing the input dataset. Saving these for future references.
#x_train <- read.fwf("X_train.txt", widths = widthVector, col.names = colNames)
#y_train <- read.csv("y_train.txt", header = FALSE, col.names = c("targetVar"))
#y_train$targetVar <- as.factor(y_train$targetVar)
#xy_train <- cbind(x_train, y_train)
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(originalDataset)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol
#colnames(originalDataset)[targetCol] <- "targetVar"
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
xy_train <- originalDataset[training_index,]
xy_test <- originalDataset[-training_index,]

if (targetCol==1) {
x_train <- xy_train[,(targetCol+1):totCol]
y_train <- xy_train[,targetCol]
y_test <- xy_test[,targetCol]
} else {
x_train <- xy_train[,1:(totAttr)]
y_train <- xy_train[,totCol]
y_test <- xy_test[,totCol]
}

1.c) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 8
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  8  by  8

1.d) Set test options and evaluation metric

# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1)
metricTarget <- "Accuracy"

1.e) Set up the email notification function

email_notify <- function(msg=""){
  sender <- "luozhi2488@gmail.com"
  receiver <- "dave@contactdavidlowe.com"
  sbj_line <- "Notification from R Script"
  password <- readLines("email_credential.txt")
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
email_notify(paste("Library and Data Loading Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2aafb23c}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##    n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words
## 2               9              255       0.6047431                1
## 3               9              211       0.5751295                1
## 4               9              531       0.5037879                1
## 7               8              960       0.4181626                1
## 10             10              231       0.6363636                1
## 11              9             1248       0.4900498                1
##    n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos
## 2                 0.7919463         3              1        1          0
## 3                 0.6638655         3              1        1          0
## 4                 0.6656347         9              0        1          0
## 7                 0.5498339        21             20       20          0
## 10                0.7971014         4              1        1          1
## 11                0.7316384        11              0        1          0
##    average_token_length num_keywords data_channel_is_lifestyle
## 2              4.913725            4                         0
## 3              4.393365            6                         0
## 4              4.404896            7                         0
## 7              4.654167           10                         1
## 10             5.090909            5                         0
## 11             4.617788            8                         0
##    data_channel_is_entertainment data_channel_is_bus
## 2                              0                   1
## 3                              0                   1
## 4                              1                   0
## 7                              0                   0
## 10                             0                   0
## 11                             0                   0
##    data_channel_is_socmed data_channel_is_tech data_channel_is_world
## 2                       0                    0                     0
## 3                       0                    0                     0
## 4                       0                    0                     0
## 7                       0                    0                     0
## 10                      0                    0                     1
## 11                      0                    0                     1
##    kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max
## 2           0          0          0          0          0          0
## 3           0          0          0          0          0          0
## 4           0          0          0          0          0          0
## 7           0          0          0          0          0          0
## 10          0          0          0          0          0          0
## 11          0          0          0          0          0          0
##    kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares
## 2           0          0          0                         0
## 3           0          0          0                       918
## 4           0          0          0                         0
## 7           0          0          0                       545
## 10          0          0          0                         0
## 11          0          0          0                         0
##    self_reference_max_shares self_reference_avg_sharess weekday_is_monday
## 2                          0                      0.000                 1
## 3                        918                    918.000                 1
## 4                          0                      0.000                 1
## 7                      16000                   3151.158                 1
## 10                         0                      0.000                 1
## 11                         0                      0.000                 1
##    weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
## 2                   0                    0                   0
## 3                   0                    0                   0
## 4                   0                    0                   0
## 7                   0                    0                   0
## 10                  0                    0                   0
## 11                  0                    0                   0
##    weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend
## 2                  0                   0                 0          0
## 3                  0                   0                 0          0
## 4                  0                   0                 0          0
## 7                  0                   0                 0          0
## 10                 0                   0                 0          0
## 11                 0                   0                 0          0
##        LDA_00     LDA_01     LDA_02     LDA_03     LDA_04
## 2  0.79975569 0.05004668 0.05009625 0.05010067 0.05000071
## 3  0.21779229 0.03333446 0.03335142 0.03333354 0.68218829
## 4  0.02857322 0.41929964 0.49465083 0.02890472 0.02857160
## 7  0.02008167 0.11470539 0.02002437 0.02001533 0.82517325
## 10 0.04000010 0.04000003 0.83999721 0.04000063 0.04000204
## 11 0.02500356 0.28730114 0.40082932 0.26186375 0.02500223
##    global_subjectivity global_sentiment_polarity
## 2            0.3412458                0.14894781
## 3            0.7022222                0.32333333
## 4            0.4298497                0.10070467
## 7            0.5144803                0.26830272
## 10           0.3138889                0.05185185
## 11           0.4820598                0.10235015
##    global_rate_positive_words global_rate_negative_words
## 2                  0.04313725                0.015686275
## 3                  0.05687204                0.009478673
## 4                  0.04143126                0.020715631
## 7                  0.08020833                0.016666667
## 10                 0.03896104                0.030303030
## 11                 0.03846154                0.020833333
##    rate_positive_words rate_negative_words avg_positive_polarity
## 2            0.7333333           0.2666667             0.2869146
## 3            0.8571429           0.1428571             0.4958333
## 4            0.6666667           0.3333333             0.3859652
## 7            0.8279570           0.1720430             0.4020386
## 10           0.5625000           0.4375000             0.2984127
## 11           0.6486486           0.3513514             0.4044801
##    min_positive_polarity max_positive_polarity avg_negative_polarity
## 2             0.03333333                   0.7            -0.1187500
## 3             0.10000000                   1.0            -0.4666667
## 4             0.13636364                   0.8            -0.3696970
## 7             0.10000000                   1.0            -0.2244792
## 10            0.10000000                   0.5            -0.2380952
## 11            0.10000000                   1.0            -0.4150641
##    min_negative_polarity max_negative_polarity title_subjectivity
## 2                 -0.125            -0.1000000                  0
## 3                 -0.800            -0.1333333                  0
## 4                 -0.600            -0.1666667                  0
## 7                 -0.500            -0.0500000                  0
## 10                -0.500            -0.1000000                  0
## 11                -1.000            -0.1000000                  0
##    title_sentiment_polarity abs_title_subjectivity
## 2                         0                    0.5
## 3                         0                    0.5
## 4                         0                    0.5
## 7                         0                    0.5
## 10                        0                    0.5
## 11                        0                    0.5
##    abs_title_sentiment_polarity targetVar
## 2                             0         0
## 3                             0         1
## 4                             0         0
## 7                             0         0
## 10                            0         0
## 11                            0         1

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 27751    59

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##                n_tokens_title              n_tokens_content 
##                     "numeric"                     "numeric" 
##               n_unique_tokens              n_non_stop_words 
##                     "numeric"                     "numeric" 
##      n_non_stop_unique_tokens                     num_hrefs 
##                     "numeric"                     "numeric" 
##                num_self_hrefs                      num_imgs 
##                     "numeric"                     "numeric" 
##                    num_videos          average_token_length 
##                     "numeric"                     "numeric" 
##                  num_keywords     data_channel_is_lifestyle 
##                     "numeric"                     "numeric" 
## data_channel_is_entertainment           data_channel_is_bus 
##                     "numeric"                     "numeric" 
##        data_channel_is_socmed          data_channel_is_tech 
##                     "numeric"                     "numeric" 
##         data_channel_is_world                    kw_min_min 
##                     "numeric"                     "numeric" 
##                    kw_max_min                    kw_avg_min 
##                     "numeric"                     "numeric" 
##                    kw_min_max                    kw_max_max 
##                     "numeric"                     "numeric" 
##                    kw_avg_max                    kw_min_avg 
##                     "numeric"                     "numeric" 
##                    kw_max_avg                    kw_avg_avg 
##                     "numeric"                     "numeric" 
##     self_reference_min_shares     self_reference_max_shares 
##                     "numeric"                     "numeric" 
##    self_reference_avg_sharess             weekday_is_monday 
##                     "numeric"                     "numeric" 
##            weekday_is_tuesday          weekday_is_wednesday 
##                     "numeric"                     "numeric" 
##           weekday_is_thursday             weekday_is_friday 
##                     "numeric"                     "numeric" 
##           weekday_is_saturday             weekday_is_sunday 
##                     "numeric"                     "numeric" 
##                    is_weekend                        LDA_00 
##                     "numeric"                     "numeric" 
##                        LDA_01                        LDA_02 
##                     "numeric"                     "numeric" 
##                        LDA_03                        LDA_04 
##                     "numeric"                     "numeric" 
##           global_subjectivity     global_sentiment_polarity 
##                     "numeric"                     "numeric" 
##    global_rate_positive_words    global_rate_negative_words 
##                     "numeric"                     "numeric" 
##           rate_positive_words           rate_negative_words 
##                     "numeric"                     "numeric" 
##         avg_positive_polarity         min_positive_polarity 
##                     "numeric"                     "numeric" 
##         max_positive_polarity         avg_negative_polarity 
##                     "numeric"                     "numeric" 
##         min_negative_polarity         max_negative_polarity 
##                     "numeric"                     "numeric" 
##            title_subjectivity      title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                     "numeric"                     "numeric" 
##                     targetVar 
##                      "factor"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##  n_tokens_title n_tokens_content n_unique_tokens  n_non_stop_words
##  Min.   : 3.0   Min.   :   0.0   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 9.0   1st Qu.: 247.0   1st Qu.:0.4703   1st Qu.:1.0000  
##  Median :10.0   Median : 411.0   Median :0.5389   Median :1.0000  
##  Mean   :10.4   Mean   : 549.5   Mean   :0.5301   Mean   :0.9704  
##  3rd Qu.:12.0   3rd Qu.: 720.0   3rd Qu.:0.6078   3rd Qu.:1.0000  
##  Max.   :23.0   Max.   :8474.0   Max.   :1.0000   Max.   :1.0000  
##  n_non_stop_unique_tokens   num_hrefs      num_self_hrefs   
##  Min.   :0.0000           Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:0.6254           1st Qu.:  4.00   1st Qu.:  1.000  
##  Median :0.6905           Median :  8.00   Median :  2.000  
##  Mean   :0.6726           Mean   : 10.94   Mean   :  3.296  
##  3rd Qu.:0.7544           3rd Qu.: 14.00   3rd Qu.:  4.000  
##  Max.   :1.0000           Max.   :304.00   Max.   :116.000  
##     num_imgs         num_videos     average_token_length  num_keywords   
##  Min.   :  0.000   Min.   : 0.000   Min.   :0.000        Min.   : 1.000  
##  1st Qu.:  1.000   1st Qu.: 0.000   1st Qu.:4.479        1st Qu.: 6.000  
##  Median :  1.000   Median : 0.000   Median :4.666        Median : 7.000  
##  Mean   :  4.557   Mean   : 1.262   Mean   :4.550        Mean   : 7.214  
##  3rd Qu.:  4.000   3rd Qu.: 1.000   3rd Qu.:4.855        3rd Qu.: 9.000  
##  Max.   :128.000   Max.   :91.000   Max.   :7.696        Max.   :10.000  
##  data_channel_is_lifestyle data_channel_is_entertainment
##  Min.   :0.00000           Min.   :0.0000               
##  1st Qu.:0.00000           1st Qu.:0.0000               
##  Median :0.00000           Median :0.0000               
##  Mean   :0.05236           Mean   :0.1786               
##  3rd Qu.:0.00000           3rd Qu.:0.0000               
##  Max.   :1.00000           Max.   :1.0000               
##  data_channel_is_bus data_channel_is_socmed data_channel_is_tech
##  Min.   :0.0000      Min.   :0.0000         Min.   :0.0000      
##  1st Qu.:0.0000      1st Qu.:0.0000         1st Qu.:0.0000      
##  Median :0.0000      Median :0.0000         Median :0.0000      
##  Mean   :0.1591      Mean   :0.0569         Mean   :0.1865      
##  3rd Qu.:0.0000      3rd Qu.:0.0000         3rd Qu.:0.0000      
##  Max.   :1.0000      Max.   :1.0000         Max.   :1.0000      
##  data_channel_is_world   kw_min_min       kw_max_min       kw_avg_min     
##  Min.   :0.0000        Min.   : -1.00   Min.   :     0   Min.   :   -1.0  
##  1st Qu.:0.0000        1st Qu.: -1.00   1st Qu.:   445   1st Qu.:  140.6  
##  Median :0.0000        Median : -1.00   Median :   660   Median :  234.5  
##  Mean   :0.2131        Mean   : 25.78   Mean   :  1161   Mean   :  312.6  
##  3rd Qu.:0.0000        3rd Qu.:  4.00   3rd Qu.:  1000   3rd Qu.:  355.8  
##  Max.   :1.0000        Max.   :318.00   Max.   :298400   Max.   :42827.9  
##    kw_min_max       kw_max_max       kw_avg_max       kw_min_avg  
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :  -1  
##  1st Qu.:     0   1st Qu.:843300   1st Qu.:173445   1st Qu.:   0  
##  Median :  1500   Median :843300   Median :245217   Median :1039  
##  Mean   : 13848   Mean   :753637   Mean   :260169   Mean   :1124  
##  3rd Qu.:  7900   3rd Qu.:843300   3rd Qu.:331582   3rd Qu.:2063  
##  Max.   :843300   Max.   :843300   Max.   :843300   Max.   :3610  
##    kw_max_avg       kw_avg_avg    self_reference_min_shares
##  Min.   :     0   Min.   :    0   Min.   :     0           
##  1st Qu.:  3557   1st Qu.: 2387   1st Qu.:   640           
##  Median :  4353   Median : 2872   Median :  1200           
##  Mean   :  5664   Mean   : 3139   Mean   :  4012           
##  3rd Qu.:  6020   3rd Qu.: 3600   3rd Qu.:  2600           
##  Max.   :298400   Max.   :43568   Max.   :843300           
##  self_reference_max_shares self_reference_avg_sharess weekday_is_monday
##  Min.   :     0            Min.   :     0             Min.   :0.0000   
##  1st Qu.:  1100            1st Qu.:   986             1st Qu.:0.0000   
##  Median :  2800            Median :  2200             Median :0.0000   
##  Mean   : 10439            Mean   :  6434             Mean   :0.1673   
##  3rd Qu.:  8000            3rd Qu.:  5200             3rd Qu.:0.0000   
##  Max.   :843300            Max.   :843300             Max.   :1.0000   
##  weekday_is_tuesday weekday_is_wednesday weekday_is_thursday
##  Min.   :0.0000     Min.   :0.000        Min.   :0.0000     
##  1st Qu.:0.0000     1st Qu.:0.000        1st Qu.:0.0000     
##  Median :0.0000     Median :0.000        Median :0.0000     
##  Mean   :0.1867     Mean   :0.188        Mean   :0.1822     
##  3rd Qu.:0.0000     3rd Qu.:0.000        3rd Qu.:0.0000     
##  Max.   :1.0000     Max.   :1.000        Max.   :1.0000     
##  weekday_is_friday weekday_is_saturday weekday_is_sunday   is_weekend    
##  Min.   :0.0000    Min.   :0.0000      Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000    1st Qu.:0.0000      1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000    Median :0.0000      Median :0.00000   Median :0.0000  
##  Mean   :0.1446    Mean   :0.0623      Mean   :0.06886   Mean   :0.1312  
##  3rd Qu.:0.0000    3rd Qu.:0.0000      3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000    Max.   :1.0000      Max.   :1.00000   Max.   :1.0000  
##      LDA_00            LDA_01            LDA_02            LDA_03       
##  Min.   :0.01818   Min.   :0.01818   Min.   :0.01818   Min.   :0.01818  
##  1st Qu.:0.02505   1st Qu.:0.02501   1st Qu.:0.02857   1st Qu.:0.02857  
##  Median :0.03339   Median :0.03335   Median :0.04000   Median :0.04000  
##  Mean   :0.18486   Mean   :0.14082   Mean   :0.21579   Mean   :0.22336  
##  3rd Qu.:0.24068   3rd Qu.:0.15029   3rd Qu.:0.33307   3rd Qu.:0.37331  
##  Max.   :0.92699   Max.   :0.92595   Max.   :0.92000   Max.   :0.92653  
##      LDA_04        global_subjectivity global_sentiment_polarity
##  Min.   :0.01818   Min.   :0.0000      Min.   :-0.38021         
##  1st Qu.:0.02857   1st Qu.:0.3964      1st Qu.: 0.05823         
##  Median :0.04124   Median :0.4540      Median : 0.11958         
##  Mean   :0.23517   Mean   :0.4435      Mean   : 0.11970         
##  3rd Qu.:0.40332   3rd Qu.:0.5083      3rd Qu.: 0.17795         
##  Max.   :0.92719   Max.   :1.0000      Max.   : 0.65500         
##  global_rate_positive_words global_rate_negative_words rate_positive_words
##  Min.   :0.00000            Min.   :0.000000           Min.   :0.0000     
##  1st Qu.:0.02843            1st Qu.:0.009615           1st Qu.:0.6000     
##  Median :0.03899            Median :0.015332           Median :0.7108     
##  Mean   :0.03959            Mean   :0.016580           Mean   :0.6826     
##  3rd Qu.:0.05017            3rd Qu.:0.021696           3rd Qu.:0.8000     
##  Max.   :0.15549            Max.   :0.162037           Max.   :1.0000     
##  rate_negative_words avg_positive_polarity min_positive_polarity
##  Min.   :0.0000      Min.   :0.0000        Min.   :0.00000      
##  1st Qu.:0.1857      1st Qu.:0.3063        1st Qu.:0.05000      
##  Median :0.2800      Median :0.3591        Median :0.10000      
##  Mean   :0.2877      Mean   :0.3543        Mean   :0.09551      
##  3rd Qu.:0.3824      3rd Qu.:0.4117        3rd Qu.:0.10000      
##  Max.   :1.0000      Max.   :1.0000        Max.   :1.00000      
##  max_positive_polarity avg_negative_polarity min_negative_polarity
##  Min.   :0.0000        Min.   :-1.0000       Min.   :-1.0000      
##  1st Qu.:0.6000        1st Qu.:-0.3283       1st Qu.:-0.7000      
##  Median :0.8000        Median :-0.2538       Median :-0.5000      
##  Mean   :0.7572        Mean   :-0.2596       Mean   :-0.5231      
##  3rd Qu.:1.0000        3rd Qu.:-0.1872       3rd Qu.:-0.3000      
##  Max.   :1.0000        Max.   : 0.0000       Max.   : 0.0000      
##  max_negative_polarity title_subjectivity title_sentiment_polarity
##  Min.   :-1.0000       Min.   :0.0000     Min.   :-1.00000        
##  1st Qu.:-0.1250       1st Qu.:0.0000     1st Qu.: 0.00000        
##  Median :-0.1000       Median :0.1500     Median : 0.00000        
##  Mean   :-0.1072       Mean   :0.2832     Mean   : 0.07293        
##  3rd Qu.:-0.0500       3rd Qu.:0.5000     3rd Qu.: 0.15000        
##  Max.   : 0.0000       Max.   :1.0000     Max.   : 1.00000        
##  abs_title_subjectivity abs_title_sentiment_polarity targetVar
##  Min.   :0.0000         Min.   :0.000                0:12943  
##  1st Qu.:0.1667         1st Qu.:0.000                1:14808  
##  Median :0.5000         Median :0.000                         
##  Mean   :0.3410         Mean   :0.158                         
##  3rd Qu.:0.5000         3rd Qu.:0.250                         
##  Max.   :0.5000         Max.   :1.000

2.a.v) Summarize the levels of the class attribute.

cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##    freq percentage
## 0 12943   46.63976
## 1 14808   53.36024

2.a.vi) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##                n_tokens_title              n_tokens_content 
##                             0                             0 
##               n_unique_tokens              n_non_stop_words 
##                             0                             0 
##      n_non_stop_unique_tokens                     num_hrefs 
##                             0                             0 
##                num_self_hrefs                      num_imgs 
##                             0                             0 
##                    num_videos          average_token_length 
##                             0                             0 
##                  num_keywords     data_channel_is_lifestyle 
##                             0                             0 
## data_channel_is_entertainment           data_channel_is_bus 
##                             0                             0 
##        data_channel_is_socmed          data_channel_is_tech 
##                             0                             0 
##         data_channel_is_world                    kw_min_min 
##                             0                             0 
##                    kw_max_min                    kw_avg_min 
##                             0                             0 
##                    kw_min_max                    kw_max_max 
##                             0                             0 
##                    kw_avg_max                    kw_min_avg 
##                             0                             0 
##                    kw_max_avg                    kw_avg_avg 
##                             0                             0 
##     self_reference_min_shares     self_reference_max_shares 
##                             0                             0 
##    self_reference_avg_sharess             weekday_is_monday 
##                             0                             0 
##            weekday_is_tuesday          weekday_is_wednesday 
##                             0                             0 
##           weekday_is_thursday             weekday_is_friday 
##                             0                             0 
##           weekday_is_saturday             weekday_is_sunday 
##                             0                             0 
##                    is_weekend                        LDA_00 
##                             0                             0 
##                        LDA_01                        LDA_02 
##                             0                             0 
##                        LDA_03                        LDA_04 
##                             0                             0 
##           global_subjectivity     global_sentiment_polarity 
##                             0                             0 
##    global_rate_positive_words    global_rate_negative_words 
##                             0                             0 
##           rate_positive_words           rate_negative_words 
##                             0                             0 
##         avg_positive_polarity         min_positive_polarity 
##                             0                             0 
##         max_positive_polarity         avg_negative_polarity 
##                             0                             0 
##         min_negative_polarity         max_negative_polarity 
##                             0                             0 
##            title_subjectivity      title_sentiment_polarity 
##                             0                             0 
##        abs_title_subjectivity  abs_title_sentiment_polarity 
##                             0                             0 
##                     targetVar 
##                             0

2.b) Data visualizations

2.b.i) Univariate plots to better understand each attribute.

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

2.b.ii) Multivariate plots to better understand the relationships between attributes

# Scatterplot matrix colored by class
# pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# featurePlot(x=x_train, y=y_train, plot="box", scales=scales)
# Density plots for each attribute by class value
# featurePlot(x=x_train, y=y_train, plot="density", scales=scales)
# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

email_notify(paste("Data Summary and Visualization Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@28864e92}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

3.a) Data Cleaning

# Not applicable for this iteration of the project.

# Mark missing values
#invalid <- 0
#entireDataset$some_col[entireDataset$some_col==invalid] <- NA

# Impute missing values
#entireDataset$some_col <- with(entireDataset, impute(some_col, mean))

3.b) Feature Selection

# Using the Stochastic Gradient Boosting (GBM) algorithm, we try to rank the attributes' importance.
startTimeModule <- proc.time()
set.seed(seedNum)
library(gbm)
## Loaded gbm 2.1.4
model_fs <- train(targetVar~., data=xy_train, method="gbm", preProcess="scale", trControl=control, verbose=F)
rankedImportance <- varImp(model_fs, scale=FALSE)
print(rankedImportance)
## gbm variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                               Overall
## kw_avg_avg                     515.87
## is_weekend                     332.33
## self_reference_min_shares      296.91
## kw_max_avg                     296.83
## data_channel_is_entertainment  261.67
## self_reference_avg_sharess     239.44
## data_channel_is_tech           217.03
## n_unique_tokens                156.52
## kw_min_avg                     153.08
## kw_max_max                     137.80
## LDA_02                         128.57
## data_channel_is_socmed         112.47
## LDA_00                          98.47
## kw_avg_max                      90.27
## kw_avg_min                      86.09
## num_hrefs                       68.24
## LDA_01                          56.44
## data_channel_is_world           54.66
## global_subjectivity             54.18
## n_non_stop_unique_tokens        52.86
plot(rankedImportance)

# Set the importance threshold and calculate the list of attributes that don't contribute to the importance threshold
maxThreshold <- 0.99
rankedAttributes <- rankedImportance$importance
rankedAttributes <- rankedAttributes[order(-rankedAttributes$Overall),,drop=FALSE]
totalWeight <- sum(rankedAttributes)
i <- 1
accumWeight <- 0
exit_now <- FALSE
while ((i <= totAttr) & !exit_now) {
  accumWeight = accumWeight + rankedAttributes[i,]
  if ((accumWeight/totalWeight) >= maxThreshold) {
    exit_now <- TRUE
  } else {
    i <- i + 1
  }
}
lowImportance <- rankedAttributes[(i+1):(totAttr),,drop=FALSE]
lowAttributes <- rownames(lowImportance)
cat('Number of attributes contributed to the importance threshold:',i,"\n")
## Number of attributes contributed to the importance threshold: 42
cat('Number of attributes found to be of low importance:',length(lowAttributes))
## Number of attributes found to be of low importance: 16
# Removing the unselected attributes from the training and validation dataframes
xy_train <- xy_train[, !(names(xy_train) %in% lowAttributes)]
xy_test <- xy_test[, !(names(xy_test) %in% lowAttributes)]

3.c) Data Transforms

# Not applicable for this iteration of the project.

3.d) Display the Final Dataset for Model-Building

dim(xy_train)
## [1] 27751    43
sapply(xy_train, class)
##                n_tokens_title              n_tokens_content 
##                     "numeric"                     "numeric" 
##               n_unique_tokens              n_non_stop_words 
##                     "numeric"                     "numeric" 
##      n_non_stop_unique_tokens                     num_hrefs 
##                     "numeric"                     "numeric" 
##                num_self_hrefs                      num_imgs 
##                     "numeric"                     "numeric" 
##                    num_videos          average_token_length 
##                     "numeric"                     "numeric" 
## data_channel_is_entertainment        data_channel_is_socmed 
##                     "numeric"                     "numeric" 
##          data_channel_is_tech         data_channel_is_world 
##                     "numeric"                     "numeric" 
##                    kw_min_min                    kw_max_min 
##                     "numeric"                     "numeric" 
##                    kw_avg_min                    kw_min_max 
##                     "numeric"                     "numeric" 
##                    kw_max_max                    kw_avg_max 
##                     "numeric"                     "numeric" 
##                    kw_min_avg                    kw_max_avg 
##                     "numeric"                     "numeric" 
##                    kw_avg_avg     self_reference_min_shares 
##                     "numeric"                     "numeric" 
##     self_reference_max_shares    self_reference_avg_sharess 
##                     "numeric"                     "numeric" 
##             weekday_is_friday           weekday_is_saturday 
##                     "numeric"                     "numeric" 
##                    is_weekend                        LDA_00 
##                     "numeric"                     "numeric" 
##                        LDA_01                        LDA_02 
##                     "numeric"                     "numeric" 
##                        LDA_03                        LDA_04 
##                     "numeric"                     "numeric" 
##           global_subjectivity    global_rate_positive_words 
##                     "numeric"                     "numeric" 
##           rate_positive_words           rate_negative_words 
##                     "numeric"                     "numeric" 
##         min_positive_polarity         avg_negative_polarity 
##                     "numeric"                     "numeric" 
##      title_sentiment_polarity        abs_title_subjectivity 
##                     "numeric"                     "numeric" 
##                     targetVar 
##                      "factor"
proc.time()-startTimeScript
##    user  system elapsed 
## 283.601   0.972 290.600
email_notify(paste("Data Cleaning and Transformation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@2a18f23c}"

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, three non-linear, and three ensemble algorithms:

Linear Algorithm: Logistic Regression

Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine

Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Logistic Regression (Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
print(fit.glm)
## Generalized Linear Model 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.655652  0.3058908
proc.time()-startTimeModule
##    user  system elapsed 
##   8.237   0.086   8.420
email_notify(paste("Logistic Regression Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@ea4a92b}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01977903  0.6130594  0.2149499
##   0.03484509  0.6030766  0.1973039
##   0.13636715  0.5678727  0.1069449
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01977903.
proc.time()-startTimeModule
##    user  system elapsed 
##  14.268   0.011  14.434
email_notify(paste("Decision Tree Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4563e9ab}"
# k-Nearest Neighbors (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.5713668  0.1390589
##   7  0.5741413  0.1440273
##   9  0.5741413  0.1434159
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
proc.time()-startTimeModule
##    user  system elapsed 
## 119.877   0.050 121.168
email_notify(paste("k-Nearest Neighbors Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@c818063}"
# Support Vector Machine (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.6586435  0.3096639
##   0.50  0.6618143  0.3166711
##   1.00  0.6648769  0.3236104
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01866281
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01866281 and C = 1.
proc.time()-startTimeModule
##     user   system  elapsed 
## 3067.300   77.111 3180.436
email_notify(paste("Support Vector Machine Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@129a8472}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6488416  0.2923625
proc.time()-startTimeModule
##    user  system elapsed 
## 297.788   0.530 301.401
email_notify(paste("Bagged CART Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7c30a502}"
# Random Forest (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.6741380  0.3408538
##   22    0.6704982  0.3348646
##   42    0.6682283  0.3307585
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
proc.time()-startTimeModule
##     user   system  elapsed 
## 3181.550   17.184 3230.688
email_notify(paste("Random Forest Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@b684286}"
# Stochastic Gradient Boosting (Regression/Classification)
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
print(fit.gbm)
## Stochastic Gradient Boosting 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.6511111  0.2917341
##   1                  100      0.6592191  0.3101098
##   1                  150      0.6627148  0.3180982
##   2                   50      0.6590028  0.3092768
##   2                  100      0.6645524  0.3220146
##   2                  150      0.6684799  0.3303902
##   3                   50      0.6624622  0.3168486
##   3                  100      0.6677594  0.3287643
##   3                  150      0.6698135  0.3333561
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
## 180.175   0.306 182.334
email_notify(paste("Stochastic Gradient Boosting Modeling Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@17c68925}"

 ### 4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.glm, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, CART, kNN, SVM, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.6441961 0.6479279 0.6575394 0.6556520 0.6590090 0.6735135    0
## CART    0.6018018 0.6062514 0.6128626 0.6130594 0.6164865 0.6327928    0
## kNN     0.5625225 0.5654567 0.5756757 0.5741413 0.5826126 0.5834234    0
## SVM     0.6418018 0.6570242 0.6681081 0.6648769 0.6712911 0.6879279    0
## BagCART 0.6313514 0.6465183 0.6469105 0.6488416 0.6499682 0.6663063    0
## RF      0.6601802 0.6672372 0.6703299 0.6741380 0.6801802 0.6944144    0
## GBM     0.6547748 0.6618031 0.6719509 0.6698135 0.6773874 0.6828829    0
## 
## Kappa 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.2813747 0.2908032 0.3093688 0.3058908 0.3131719 0.3424442    0
## CART    0.1882878 0.2040335 0.2113768 0.2149499 0.2270599 0.2535311    0
## kNN     0.1199373 0.1257108 0.1476646 0.1440273 0.1604046 0.1633571    0
## SVM     0.2760263 0.3068133 0.3306548 0.3236104 0.3361399 0.3704285    0
## BagCART 0.2557445 0.2876354 0.2886970 0.2923625 0.2944650 0.3275340    0
## RF      0.3124632 0.3262607 0.3341703 0.3408538 0.3535053 0.3821219    0
## GBM     0.3032725 0.3179184 0.3377054 0.3333561 0.3491367 0.3605639    0
dotplot(results)

cat('The average accuracy from all models is:',
    mean(c(results$values$`LR~Accuracy`,results$values$`CART~Accuracy`,results$values$`kNN~Accuracy`,results$values$`SVM~Accuracy`,results$values$`BagCART~Accuracy`,results$values$`RF~Accuracy`,results$values$`GBM~Accuracy`)))
## The average accuracy from all models is: 0.6429318

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the two best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry=c(2,3,4,5))
fit.final1 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Random Forest 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.6731648  0.3388394
##   3     0.6748581  0.3426844
##   4     0.6742459  0.3418766
##   5     0.6720836  0.3375595
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
proc.time()-startTimeModule
##     user   system  elapsed 
## 2991.676   19.916 3041.369
email_notify(paste("Algorithm #1 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@96532d6}"
# Tuning algorithm #2 - Stochastic Gradient Boosting
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(.n.trees=c(300,500,700,900), .shrinkage=0.1, .interaction.depth=c(2,3), .n.minobsinnode=10)
fit.final2 <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
plot(fit.final2)

print(fit.final2)
## Stochastic Gradient Boosting 
## 
## 27751 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 24976, 24976, 24977, 24976, 24976, 24975, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   2                  300      0.6718675  0.3377429
##   2                  500      0.6728763  0.3401466
##   2                  700      0.6748220  0.3441818
##   2                  900      0.6746781  0.3439168
##   3                  300      0.6726242  0.3397446
##   3                  500      0.6751105  0.3446334
##   3                  700      0.6746782  0.3438693
##   3                  900      0.6741737  0.3429761
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
## 807.642   0.102 815.714
email_notify(paste("Algorithm #2 Tuning Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1554909b}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(RF=fit.final1, GBM=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.6576577 0.6663359 0.6731532 0.6748581 0.6770241 0.7005405    0
## GBM 0.6590991 0.6719200 0.6749550 0.6751105 0.6811712 0.6915315    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.3071212 0.3253198 0.3396705 0.3426844 0.3472923 0.3954288    0
## GBM 0.3115274 0.3389015 0.3444874 0.3446334 0.3575708 0.3787939    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

6.a) Predictions on validation dataset

predictions <- predict(fit.final2, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3380 1814
##          1 2167 4532
##                                           
##                Accuracy : 0.6653          
##                  95% CI : (0.6567, 0.6737)
##     No Information Rate : 0.5336          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3248          
##  Mcnemar's Test P-Value : 2.421e-08       
##                                           
##             Sensitivity : 0.6093          
##             Specificity : 0.7142          
##          Pos Pred Value : 0.6508          
##          Neg Pred Value : 0.6765          
##              Prevalence : 0.4664          
##          Detection Rate : 0.2842          
##    Detection Prevalence : 0.4367          
##       Balanced Accuracy : 0.6617          
##                                           
##        'Positive' Class : 0               
## 
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)

auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.6617445

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(gbm)
set.seed(seedNum)

# Combining the training and test datasets to form the original dataset that will be used for training the final model
xy_train <- rbind(xy_train, xy_test)

#finalModel <- gbm(targetVar ~ ., data = xy_train, n.trees=700, verbose=F)
grid <- expand.grid(.n.trees=500, .shrinkage=0.1, .interaction.depth=3, .n.minobsinnode=10)
finalModel <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
print(finalModel)
## Stochastic Gradient Boosting 
## 
## 39644 samples
##    42 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 35679, 35679, 35680, 35680, 35680, 35680, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6728129  0.3396459
## 
## Tuning parameter 'n.trees' was held constant at a value of 500
##  3
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
proc.time()-startTimeModule
##    user  system elapsed 
## 418.730   0.028 422.976
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@668bc3d5}"

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
proc.time()-startTimeScript
##      user    system   elapsed 
## 11377.265   116.496 11635.195